[1] "Beautiful day"
[1] NA
[1] "Beautiful NA"
[1] "Beautiful day"
[1] NA
[1] "Beautiful NA"
[1] 5 5
[1] 5 5
Error in nchar(f): 'nchar()' requires a character vector
[1] 4 4 8 3
[1] "Bruc" "Wayn"
[1] "uce" "yne"
[1] FALSE TRUE TRUE
[1] "pepperoni" "sausage and green peppers"
[1] 0 1 1
[1] 3 2 5
str_split(): pull apart raw string data into more useful variables
[[1]]
[1] "23.01.2017" "29.01.2017"
[[2]]
[1] "30.01.2017" "06.02.2017"
[1] "192" "118" "001"
[1] "510.555.0123" "541.555.0167"
"10202"?"102a"? What about in this "1O2"?"2,32.1,0.4"!grep - global regex print. Is there a patern in a string?grepl - returns logical value. \ | ( ) [ { ^ $ * + ?\- escape character. - any (just one) character^ - begining of a string$ - end of string[1] 1 2 5
[1] 1
[1] 2
[1] 2 3 4
[1] FALSE TRUE TRUE TRUE FALSE FALSE
[1] 1 5 6
[1] TRUE FALSE FALSE FALSE TRUE TRUE
[1] 1 6
[1] 2 3 4 5
[1] 3
[1] 1 2 4 5 6
[1] 1 2 4
[1] 1 2
[1] 4 7 8
| - or sign() - group[1] 1 2 3 5 6
[1] 1 2 3
[1] 1 2 3 6
? - matches at most 1 times* - matches at least 0 times+ - matches at least 1 times{m} – matches exactly m times{m, n} – matches between m and n times{m, } – matches at least m times[1] 2 3 4 5 6
[1] 3 4 5 6
[1] "ab" "acb"
[1] "accb"
[1] "accb" "acccb" "accccb"
[1] "accb" "acccb"
[1] 1 2
[1] 1 7
[1] 2 4
[1] "<EM>first</EM>"
[1] "<EM>"
[[1]]
[1] 17 46
attr(,"match.length")
[1] 3 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[1]]
[1] "100" "45"
[1] "bell pepper" "blood orange" "canary melon"
[4] "chili pepper" "goji berry" "kiwi fruit"
[7] "purple mangosteen" "rock melon" "salal berry"
[10] "star fruit" "ugli fruit"
[1] "100"
[[1]]
[1] "100" "45"
1. Collection of text document
2. Pre – processing of text
3. Text mining techniques
4. Analyze the text
5. Knowledge discovery
# A tibble: 13 × 2
id word
<int> <chr>
1 1 great
2 1 white
3 1 shark
4 1 just
5 1 ate
6 1 my
7 1 leg
8 2 not
9 2 a
10 2 wonderful
11 2 day
12 2 and
13 2 days
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc)
not adding much information to the text
examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”,…
why removing stop words; removing the low-level information from our text in order to give more focus to the important information
Do we always remove stop words? NO!
Before removing stop words, research a bit about your task and the problem you are trying to solve, and then make your decision!
# A tibble: 7 × 2
id word
<int> <chr>
1 1 white
2 1 shark
3 1 ate
4 1 leg
5 2 wonderful
6 2 day
7 2 days
# A tibble: 13 × 2
id word
<int> <chr>
1 1 great
2 1 white
3 1 shark
4 1 just
5 1 at
6 1 my
7 1 leg
8 2 not
9 2 a
10 2 wonder
11 2 dai
12 2 and
13 2 dai
# A tibble: 13 × 2
id word
<int> <chr>
1 1 great
2 1 white
3 1 shark
4 1 just
5 1 eat
6 1 my
7 1 leg
8 2 not
9 2 a
10 2 wonderful
11 2 day
12 2 and
13 2 day
many packages:
[[1]]
[1] "The hotel is ideally located and is in a beautiful building."
[2] "Most of the staff are very polite and helpful."
[3] "Rooms are comfortable and it has a serviceable gym."
[4] "Avoid going to breakfast before 0700 or wearing flip flops or slippers, you will be admonished and sent back to your room to change."
[[2]]
[1] "The hotel is a short walk to the pedestrian mall, restaurants and cafes."
[2] "The hotel is an old historical landmark."
[3] "I loved the tall ceilings, lobby and restaurant."
[4] "The bathroom has been updated and is very nice."
[5] "The breakfast buffet is very good with many options and you can eat outside."
[6] "We enjoyed our stay here."
attr(,"class")
[1] "get_sentences" "get_sentences_character"
[3] "list"
element_id word_count sd ave_sentiment
1: 1 52 0.3861717 0.2993681
2: 2 56 0.2501910 0.2914671
element_id word_count sd ave_sentiment
1: 1 52 0.1763900 0.14263246
2: 2 56 0.1420433 0.02832483
element_id word_count sd ave_sentiment
1: 1 8 NA -0.03535534
element_id word_count sd ave_sentiment
1: 1 8 NA -0.04787702
[1] "2019-05-05 18:51:32 CEST"
[1] "2019-05-05 18:51:32 CEST"
[1] "2019-05-05 18:51:32 UTC"
[1] "2024-01-17"
[1] "2024-01-17"
[1] "2024-01-17 11:37:35 CET"
[1] "2024-01-17 11:37:35 CET"
[1] "2017-01-31"
[1] NA
[1] "2017-01-31"
[1] "2018-10-17"
[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 08:01:00 UTC"
[1] "2007-11-05"
[1] "2007-11-05 15:07:00 UTC"
[1] "2024-01-17 UTC"
[1] "2024-01-17"
[1] 2019
[1] 5
[1] 5
[1] 125
[1] 1
[1] 19
[1] 23
[1] 13
[1] May
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
[1] Sunday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
Time difference of 17660 days
[1] "1525824000s (~48.35 years)"
[1] "15s"
[1] "600s (~10 minutes)"
[1] "43200s (~12 hours)" "86400s (~1 days)"
[1] "0s" "86400s (~1 days)" "172800s (~2 days)"
[4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
[1] "1814400s (~3 weeks)"
[1] "31557600s (~1 years)"
[1] "15S"
[1] "10M 0S"
[1] "12H 0M 0S" "24H 0M 0S"
[1] "7d 0H 0M 0S"
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
[5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
[1] "21d 0H 0M 0S"
[1] "1y 0m 0d 0H 0M 0S"
[1] "2019-10-19 06:00:00 UTC"
[1] "1111d 10H 37M 36.0438408851624S"
dataNYC$pickup_datetime<-ymd_hms(dataNYC$pickup_datetime))[1] "2008-09-28"
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
[1] "01-03-18"
[1] "01-Mrz-2018"
[1] "01-März-18"
[1] "Mrz 01, 2018"
[1] "März 01, 2018"
[1] Donnerstag
7 Levels: Sonntag < Montag < Dienstag < Mittwoch < Donnerstag < ... < Samstag
[1] ""
Use sys.getlocale and sys.set.locate to:
In this exercise you will work with the date, “1930-08-30”, Warren Buffett’s birth date! Mind the locale language!
uros.godnov@gmail.com